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Abstract 

Probabilistic graphical models (PGMs) have become a popular tool for computational analysis of 
biological data in a variety of domains. But, what exactly are they and how do they work? How can 
we use PGMs to discover patterns that are biologically relevant? And to what extent can PGMs help 
us formulate new hypotheses that are testable at the bench? This note sketches out some answers and 
illustrates the main ideas behind the statistical approach to biological pattern discovery. 

Overview. Probabilistic graphical models offer a common conceptual architecture where biological and 
mathematical objects can be expressed with a common, intuitive formalism. This enables effective com- 
munication between scientists across the mathematical divide by fostering substantive debate in the context 
of a scientific problem, and ultimately facilitates the joint development of statistical and computational 
tools for quantitative data analysis. A number of success stories have appeared over the years (Felsenstein, 
1981; Pritchard et al., 2000; Friedman, 2004; Xing and Katp, 2004). Today, probabilistic graphical models 
promise to play a major role in the resolution of many intriguing conundrums in the biological sciences. The 
goal of this short article is to be a dense, informative introduction to the language of probabilistic graphical 
models, for beginners, with pointers to successful applications in selected areas of biology. The exposition 
introduces the essential concepts involved in PGMs in the context of the various stages of a typical collabo- 
ration between natural and computational scientists, and discusses the aspects to which each scientist should 
contribute to carry out the data analysis successfully using PGMs. 

Let us start by considering a specific problem in transcriptional regulation. Given measurements about 
the abundance of gene transcripts in retinal cells across stages of development, we would like to discover 
which functional processes are relevant for development, and reveal which ones are most important at which 

'Address correspondence to: Edo Airoldi, Carl Icahn Laboratory, Princeton University, Princeton, NJ 08544, USA. 



E. M. Airoldi Getting started in probabilistic graphical models 

stage. To develop a PGM to address this problem, we begin by identifying the biological objects that would 
appear in a cartoon model of how cellular development impacts transcription. In this illustrative example, 
we have genes and functional processes/contexts. It is reasonable to assume that each gene will participate 
in multiple functional processes, although typically in a small number of them, and that not all functional 
processes will be important at all stages of development. We then assess what aspects of the problem 
we can probe directly, with experimental techniques, and what aspects we cannot. In the example, while an 
abundance of gene transcripts can be obtained, for instance, via SAGE (serial analysis of gene expression), it 
is harder to measure functional processes. However, the latter could be operationally defined as sets of genes 
that share a similar temporal regulation pattern; this definition has the advantage of creating a connection 
between membership of genes to functional processes (i.e., an unobservable mapping) and similarity of the 
temporal expression profiles (i.e., observable quantities). The establishment of connections between those 
biological objects that we can probe and those that we cannot ends a first conceptual effort. 

A cartoon model of how cellular development impacts transcription is now specified in terms of genes 
and their abundance, functional processes, and membership of genes to functional processes. Next we trans- 
late the biological players and the connections we established among them into mathematical quantities (i.e., 
random variables) and connections among them (i.e., statistical dependencies). This translation specifies the 
model structure. At this stage, we rely on biological intuitions to fine-tune the model, for instance, by decid- 
ing which sources of variability in the measurements carry information about the latent variables and which 
do notif the temporal expression profiles of genes A and B are similar on a relative scale, but their absolute 
abundance is quite different, should we believe that they both participate in the same functional processes? 
Last, we assign numerical values to those quantities that are unknown in the final model specifications (i.e., 
we fit the model to the data) and we use them to develop biological intuitions in the context of the original 
problem. (Functional aspects of retinal development, in mouse, are fully addressed in Airoldi et al., 2006b.) 

In the following, we briefly introduce the basic mathematical quantities that enable the translation of a 
cartoon model of biology into a PGM, and we review strategies to assign numerical values to the unknown 
quantities underlying any PGM that are most likely given the observations. We conclude with an overview 
of selected applications, complete with pointers to published work. 
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Figure 1: Two equivalent representations of the same probabilistic graphical model. The left panel shows 
the full model, and the right panel shows the same model expressed in compact form. Nodes denote random 
variables, observed random variables are shaded while latent random variables are not, edges denote possible 
dependences. The box in the right panel is called a plate; it denotes IID replicates. 

The Basics. A probabilistic graphical model defines a family of probability distributions that can be rep- 
resented in terms of a graph. Nodes in the graph correspond to random variables; its structure translates 
into statistical dependencies (among such variables) that drive the computation of joint, conditional, and 
marginal probabilities of interest (Jordan, 2004). In applications, most of the (node-specific) random vari- 
ables are chosen to express the variability of an observed quantity, such as the expression of a specific gene 
measured under a certain condition. Some random variables, however, may specify unobserved quantities 
that are believed to influence the observable outcomes of a given experiment, such as which cellular pro- 
cesses were active at the time measurements were taken. The (directed or undirected) arcs of the graph 
specify the biological hypotheses about how observable and latent quantities influence one another. A set of 
constants underlying the distributions of the random variables completes the picture. These constants are re- 
ferred to as parameters in the frequentist paradigm and as hyper-parameters in the Bayesian paradigm. (See 
Wasserman (2004, pp. 185-189) for a discussion of when the distinction matters in practice, with examples.) 
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Figure 1 shows an example of a probabilistic graphical model for gene expression. (We note that there 
is a considerable overlap between the class of probabilistic graphical models and the class of Bayesian 
networks. A number of scholars choose to refer to PGMs that can be represented as directed acyclic graphs, 
with nodes corresponding to discrete-valued random variables, encoding observed measurements, and no 
latent variables as Bayesian networks.) The model encodes the intuition that the observed expression of 
a gene, Y{g), depends on the latent functional process it is involved in, X(g). The underlying constants, 
(a,P), control the probability that any given functional process is active and the probability of observing 
expression of a certain magnitude, respectively. The left panel shows the full model, and the right panel 
shows the same model expressed in compact form. 

The likelihood function, or the probability of the measurements given the underlying constants, is the 
main quantity of interest in PGMs. It summarizes how well the observations are explained by the specific 
PGM that is identified by a given value of the underlying constants. The likelihood can be computed using 
the structural hypotheses encoded by the graph, and the probability distributions specified for the nodes. 
Continuing the example, the likelihood corresponding to the model in Figure 1 is computed as follows, 



Pr(Y\a,P) = [ Pr (Y,X \a,(3) dX 
Jx 
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= £{Y\@), (3) 

for = (q, j3). The joint probability of measurements and latent variables given the underlying constants, 
that is, the integrand on the right-end side of Equation 1 , is often referred to as the complete likelihood 
function in the literature — an important quantity in the statistical treatment of PGMs with latent variables. 



Estimation and Inference. A family of PGMs is fit to the data to find likely values for its underlying 
constants and likely distributions for its latent variables. This process boils down to an optimization problem 
where the objective function is based on the likelihood. Considered jointly, the estimation and inference 
tasks identify a specific model in the family of PGMs that is defined by the assumptions on the graph and 
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the random variables, which successfully summarizes the variability of the observations. 

In the language of the statistical literature, we distinguish the task of estimating the underlying constants 
(i.e., the parameters in a frequentist statistical setting, or the hyper-parameters in a Bayesian statistical 
setting) of a probabilistic graphical model, from the task of inferring the distributions of the latent variables 
given the observations. Let us consider strategies to address the latter task first. The choice among the 
many strategies available is often informed by the complexity of the model, and in particular by whether the 
integral on the right-end side of Equation 1 can be computed in closed form. Exact inference is available 
for models that belong to special families (Jordan, 2004). Focusing on the biology of the problem, however, 
often leads to a model structure and probabilistic specifications that cannot be subsumed under any special 
family. The likelihood is intractable, in many such cases — that is, the integral in Equation 1 cannot be 
solved in closed form — and we resort to approximations. Below, we briefly survey the intuitions behind 
three popular strategies to perform approximate inference in PGMs: Monte Carlo Markov chains (sampling- 
based), and expectationmaximization (EM) and variational methods (optimization-based). 

Monte Carlo Markov chains (MCMC) techniques such as the Gibbs or Metropolis-Hastings samplers 
can be used to explore the joint posterior distribution of the latent variables (Gelman et al., 1995; Robert and 
Casella, 2005). Although the likelihood is intractable, the complete likelihood Pr ( Y, X \ a, (3 ) can be 
easily computed for the large majority of PGMs. The main concept behind MCMC schemes is to work with 
the complete likelihood, and reduce the full joint posterior to lower-dimensional conditional distributionson 
individual, or blocks of latent variablesthat we can sample from. Samples from the joint posterior are then 
obtained by composing conditional samples. The Gibbs sampler, for instance, requires that one can sample 
from all univariate, full-conditional distributions, 

Pr(X(g)\X ( _ g) ,Y,a,p), for g = 1, . . . , G, (4) 

where Xt g \ is the collection of random variables X without X(g). The Metropolis-Hastings sampler 
requires that one can at least compute a quantity proportional to the desired posterior — samples are drawn 
from an arbitrary proposal distribution and are accepted or rejected using a formula that depends on the 
proposal. Other sampling-based algorithms such as particle filters can be used to perform inference in 
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PGMs of sequential observations (Liu, 2001). 

The two alternatives to sampling we survey here aim at approximating the integral on right-end side of 
Equation 1. The main idea shared by both approaches is to find a lower bound for the likelihood, £(Y | 0), 
making use of Jensen's inequality and of an arbitrary distribution on the latent variables q{X), 

log I ( Y | 9 ) = log / Pr (Y, X | 9) dX 

Jx 

f Pr (Y X \ &] 
= log J q(X) qHf) dX (f ° r any q) 

f Pr (Y. X I 0) 
> / q{X) log ^— dX (Jensen's inequality) 

Jx q(X) 

= E q [ log Pr (Y, X \@) — log q{X) ] 4 C{q,Q) (5) 

In EM, the lower bound C{q, Q) is then iteratively maximized with respect to 0, in the M step, and q in the 
E step (Dempster et al., 1977). In particular, at the t-th iteration of the E step the q distribution must satisfy 
the following equation: 

Pr ( X | Y, e ( * _1) ) . (6) 



(*) 



That is, we set the arbitrary distribution q equal to the posterior distribution of the latent variables given the 
data and the estimates of the parameters at the previous iteration. Unfortunately, it is not always possible to 
express the distribution gW in Equation 6 in analytic form. In such cases, a variational approximation to the 
EM algorithm (Jordan et al., 1999) can be obtained by defining a parametric approximation to the posterior 
in Equation 6, denoted by q = q^{X), which involves an extra set of variational parameters, A, and leads 
to an approximate lower bound for the likelihood C^(q, 0). At the t-th iteration of the E step, we then 
minimize the Kullback-Leibler divergence between and qg, with respect to A, using the data — this is 
equivalent to maximizing the approximate lower bound for the likelihood, £a(<7> ©)> with respect to A. The 
optimal parametric approximation can be thought of as an approximate posterior distribution for the latent 
variables in the sense that it depends on the data Y, although indirectly, 

q^~q% {Y) {X)=Pr (X \ Y). 
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Let us now return to the task of estimating the constants underlying a PGM; few established strategies 
exist. The estimates for the underlying constants may be chosen, for instance, to maximize the likelihood, 
or to match empirical and theoretical moments of the random variables that correspond to measurements 
(Wasserman, 2004, pp. 120-124). Alternatively, when the likelihood is too difficult or expensive to compute, 
an approximation, £a ~ i, or a lower-bound, C < i, for the likelihood can be used as a surrogate. These 
alternatives and others are sometimes referred to as empirical Bayes estimates in the context of non-trivial 
probabilistic graphical models (Carlin and Louis, 2005, Chapter 3). 

Popular software packages that implement a language to specify and fit PGMs are available. For MCMC 
see BUGS (Lunn et al., 2000); for variational inference see VIBES (Bishop et al., 2003). 

Applications. With the technical machinery we just introduced, we are now ready to bring the biological 
intuition back into the picture. Let us continue with the transcriptional regulation example. In the PGM 
of Figure 1, the expression of gene g may be encoded by a real- valued random variable Y(g). The mixed 
membership of gene g to non-observable biological contexts may be encoded by the nonzero components of 
a latent random vector, X(g). The number of latent biological contexts we ask the PGM to infer, denoted by 
K, is an important quantity in this model, which we discuss later — briefly, K specifies the dimensionality 
of this PGM, that is, the number of components of the vector-valued latent variables, X(g). The two 
constants (a, j3) may be used to encode biological constraints. For instance, a may be used to introduce a 
notion of biological parsimony in the form of a probabilistic (soft) constraint on the number of biological 
contexts each gene may participate in, and may be used to specify gene expression patterns in the form 
of differential expression levels across those experimental conditions for which microarray measurements 
were taken — alternative pattern specifications and parameterizations exist (Airoldi et al., 2006b). For any 
given number of latent biological contexts, K, the PGM is fit to the data. Estimation and inference will 
assign numerical values to the unknown quantities (X, a, (3). These quantities provide us with model- 
based and observation-induced summaries of the data. In the example, for instance, while f3 summarizes 
gene expression patterns that summarize the main trends of transcription in a collection of microarrays, the 
values assigned to the latent variables, X(g), provide gene-specific information that can be used for making 
fine-grained predictions. 
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In the last last stage of the analysis, we assess of the biological relevance of the patterns we inferred from 
the data (such as the biological contexts, or gene co-expression patterns, in the example) to make sure the 
model is capturing the signal we set out to capture, and we use the inferred patterns to gain insights into the 
problem. Assessment of biological relevance can be qualitative or quantitative. Qualitative methods such 
as visual inspection are typically useful for focused scientific endeavors; for instance, whenever biological 
problem targets a small set of genes or a specific cellular process or component, or a signaling pathway. 
Quantitative methods are necessary for genome-wide scientific endeavors, and typically rely on knoledge 
based repositories and ontologies (such as gene ontology, Ashburner et al., 2000) and bioinformatics tools to 
carry out the evaluation (e.g., Boyle et al., 2004; Myers et al., 2006). Arguably, in any given application, the 
more interpretable the patterns are, in terms of functional processes and other biological concepts of interest, 
the better the family of PGMs captures some aspects of biology that may be relevant for the understanding 
of the phenomenon under investigation, and that are not directly measurable with experimental techniques. 

Moving a step forward, the goodness of model fit is often taken as a measure of how well the data support 
structural biological hypotheses encoded by the cartoon model of biology that was used to posit a given 
family of PGMs. Measures of goodness of model fit include the Bayesian information criterion, the held- 
out likelihood obtained using bootstrap or cross-validation techniques, measures of predictive power such 
as the predictive R 2 in linear regression, or other quantities, depending on the goals of the analysis. (These 
measures can also be used to select the dimensionality, K, of the PGM in the example.) The goodness of 
fit, along with the substantive value of the inferred patterns, should inform a critical review of the biological 
assumptions underlying the initial cartoon model, and possibly suggest new hypotheses — testable either 
with new statistical analyses, or with new experimental probes at the bench. In this sense, probabilistic 
graphical models contribute to an iterative process of scientific discovery, where statistical and biological 
thinking are intertwined as both cause and effect. 

There is a rich history of applied research that leverages the probabilistic graphical models approach 
outlined above to problems in the biological sciences. It includes a model for inferring the ancestral pop- 
ulation structure of individuals starting from a collection of multilocus genotype measurements (Pritchard 
et al., 2000; Sohn and Xing, 2007) and a model for inferring HIV mutation patterns from longitudinal clonal 
sequence data (Beerenwinkel and Drton, 2007); the former model is closely related to the classic probabilis- 
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tic graphical models to infer phylogenetic trees (Felsenstein, 1981; Felsenstein and Churchill, 1996) and to 
recent extensions, in particular, that take into account the dependence among the bases at neighboring sites 
(McAuliffe et al., 2004; Siepel and Haussler, 2004). Models for sequence analysis are well established in 
the community (Durbin et al., 1998; Xing and Karp, 2004); more recently, the connection between sequence 
information and gene expression has been investigated using probabilistic graphical models as well (Segal 
et al., 2003; Beer and Tavazoie, 2004). Other applications of this research include: a model for predicting 
the clinical status of breast cancer using gene expression profiles (West et al., 2001); a model for facili- 
tating content browsing of biomedical literature about the nematode Caenorhabditis elegans (Blei et al., 

2006) ;a model for inferring the location of chromosome aberrations from array-based comparative genomic 
hybridization measurements (Myers et al., 2004); and an extension that leverages array-based comparative 
genomic hybridization profiles from multiple individuals to recover shared aberration patterns (Shah et al., 

2007) ; a model for reconstructing features of the internal organization of the cell from the nested structure of 
observed perturbation effects, such as those measured via high-dimensional phenotype screens (Markowetz 
et al., 2007); a model for inferring proteins' multiple functional roles from a large collection of manually 
curated protein interactions, as well as cross-talk patterns among proteins that participate in distinct func- 
tional processes (Airoldi et al., 2006a); and a model for inferring temporal patterns of coexpressed genes 
from time-course expression data measured via SAGE and microarray technologies (Airoldi et al., 2006b). 

Note that the graphical representation of a family of PGMs goes only so far in specifying the model; its 
informative, but not exhaustive. Probabilistic assumptions and some features of the sampling scheme cannot 
be specified by the graph. Such subtle variants typically make a significant difference in applications. 

Conclusions. Probabilistic graphical models offer a common conceptual architecture where biological and 
mathematical objects can be expressed with a common, intuitive formalism. This enables effective commu- 
nication between scientists across the mathematical divide by fostering substantive debate in the context of a 
scientific problem, and ultimately facilitates the joint development of statistical and computational tools for 
quantitative data analysis. In other words, probabilistic graphical models provide a bridge between biology 
and statistical computations. These models recently earned a spot at the center stage of modern (computa- 
tional) biology by furthering our ability to probe data for biological hypotheses, and will undoubtedly play 
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an important role in resolving many intriguing conundrums in the biological sciences, in the future. 



Acknowledgments. This research was partly supported by United States National Institute of General 
Medical Sciences Center of Excellence grant P50 GM071508, by National Science Foundation grants DBI- 
0546275 and IIS-0513552, and by National Institutes of Health grant R01 GM071966. The author thanks 
Florian Markowetz, Chad Myers, David Hess, and Olga Troyanskaya at Princeton, and Eric Xing at Carnegie 
Mellon, for comments on an early draft of this manuscript. 

Essential Bibliography. 

E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership analysis of high-throughput in- 
teraction studies — relational data. Manuscript, 2006a. URL http: / '/arxiv . org/abs/0706 . 0294/. 

E. M. Airoldi, S. E. Fienberg, and E. P. Xing. Mixed membership analysis of genome-wide expression 
studies — attribute data. Manuscript, 2006b. URL http://arxiv.org/ahs/0712.xxxx/. 

M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. 
Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. 
Richardson, M. Ringwald, G. M. Rubinand, and G. Sherlock. Gene ontology: Tool for the unification of 
biology. The gene ontology consortium. Nature Genetics, 25(l):25-29, 2000. 

M. A. Beer and S. Tavazoie. Predicting gene expression from sequence. Cell, 117:185-198, 2004. 

N. Beerenwinkel and M. Drton. A mutagenetic tree hidden Markov model for longitudinal clonal HIV 
sequence data. Biostatistics, 8(1):53-71, 2007. 

C. Bishop, D. Spiegelhalter, and J. Winn. VIBES: A variational inference engine for Bayesian networks. In 
S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, 
pages 777-784. MIT Press, Cambridge, MA, 2003. URL http :/ /vibes . source forge . net /. 

D. M. Blei, K. Franks, M. I. Jordan, and I. S. Mian. Statistical modeling of biomedical corpora: Mining the 
Caenorhabditis genetic center bibliography for genes related to life span. BMC Bioinformatics, 7(250), 
2006. 

10 



E. M. Airoldi Getting started in probabilistic graphical models 

E. I. Boyle, S. Weng, J. Gollub, H. Jin, D. Botstein, J. M. Cherry, and G. Sherlock. GO::TermFinder — open 
source software for accessing Gene Ontology terms associated with a list of genes. Bioinformatics, 20 
(18):3710-3715, 2004. 

B. P. Carlin and T. A. Louis. Bayes and Empirical Bayes Methods for Data Analysis. Chapman & Hall, 
second edition, 2005. 

A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. 
Journal of the Royal Statistical Society, Series B, 39:1-38, 1977. 

R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic models of 
proteins and nucleic acids. Cambridge University Press, 1998. 

J. Felsenstein. Evolutionary trees fromDNA sequences. Journal of Molecular Evolution, 17:368-376, 1981. 

J. Felsenstein and G. A. Churchill. A hidden Markov model approach to variation among sites in rate of 
evolution. Molecular Biology and Evolution, 13:93-104, 1996. 

N. Friedman. Inferring cellular networks using probabilistic graphical models. Science, 303:799-805, 2004. 

A. Gelman, J. Carlin, H. Stern, and D. Rubin. Bayesian data analysis. Chapman & Hall, London, 1995. 

M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. Introduction to variational methods for graphical 
models. Machine Learning, 37:183-233, 1999. 

M. I. Jordan. Graphical models. Statistical Science, 19(1): 140-155, 2004. 

J. S. Liu. Monte Carlo Strategies in Scientific Computing. Springer- Verlag, 2001. 

D. J. Lunn, A. Thomas, N. G. Best, and D. J. Spiegelhalter. WinBUGS: A Bayesian modelling framework: 
concepts, structure and extensibility. Statistics and Computing, 10:321-333,2000. URL http: //www. 
mrc-bsu . cam . ac . uk/bugs /. 

F. Markowetz, D. Kostka, O. G. Troyanskaya, and R. Spang. Nested effects models for high-dimensional 
phenotyping screens. Bioinformatics, 2007. In press. 



11 



E. M. Airoldi Getting started in probabilistic graphical models 

J. D. McAuliffe, L. Pachter, and M. I. Jordan. Multiple-sequence functional annotation and the generalized 
hidden Markov phylogeny. Bioinformatics, 20(12): 1850-1860, 2004. 

C. L. Myers, M. J. Dunham, S. Y. Kung, and O. G. Troyanskaya. Accurate detection of aneuploidies in array 
CGH and gene expression microarray data. Bioinformatics, 20(18):3533-3543, 2004. 

C. L. Myers, D. A. Barret, M. A. Hibbs, C. Huttenhower, and O. G. Troyanskaya. Finding function: An 
evaluation framework for functional genomics. BMC Genomics, 7(187), 2006. 

J. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure using multilocus genotype 
data. Genetics, 155:945-959, 2000. 

C. Robert and G. Casella. Monte Carlo Statistical Methods. Springer Texts in Statistics. Springer- Verlag, 
New York, NY, corrected second edition, 2005. 

E. Segal, R. Yelensky, and D. Roller. Genome-wide discovery of transcriptional modules from dna sequence 
and gene expression. Bioinformatics, 19(Suppl. I):i273-i282, 2003. 

S. P. Shah, W. L. Lam, R. T. Ng, and R. P. Murphy. Modeling recurrent DNA copy number alterations in 
array CGH data. Bioinformatics, 2007. In press. 

A. Siepel and D. Haussler. Combining phylogenetic and hidden Markov models in biosequence analysis. 
Journal of Computational Biology, 1 1(2— 3):413— 428, 2004. 

R. Sohn and E. P. Xing. Spectrum: Joint Bayesian inference of population structure and recombination 
event. Bioinformatics, 2007. In press. 

L. Wasserman. All of Statistics. Springer- Verlag, 2004. 

M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J. A. Olson, Jr., J. R. Marks, 
and J. R. Nevins. Predicting the clinical status of human breast cancer by using gene expression profiles. 
Proceedings of the National Academy of Sciences, 98(20): 11462-1 1467, 2001. 

E. P. Xing and R. M. Rarp. MotifPrototyper: A profile bayesian model for motif family. Proceedings of the 
National Academy of Sciences, 101:10523-10528, 2004. 



12 



